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Abstract — In this paper, we design distributed spectrum access 
mechanisms with both complete and incomplete network infor- 
mation. We propose an evolutionary spectrum access mechanism 
with complete network information, and show that the mecha- 
nism achieves an equilibrium that is globally evolutionarily stable. 
With incomplete network information, we propose a distributed 
learning mechanism, where each user utilizes local observations 
to estimate the expected throughput and learns to adjust its 
spectrum access strategy adaptively over time. We show that 
the learning mechanism converges to the same evolutionary 
equilibrium on the time average. Numerical results show that the 
proposed mechanisms achieve up to 35% performance improve- 
ment over the distributed reinforcement learning mechanism in 
the literature, and are robust to the perturbations of users' 
channel selections. 



I. Introduction 

Cognitive radio is envisioned as a promising technique 
to alleviate the problem of spectrum under-utilization |fl]. It 
enables unlicensed wireless users (secondary users) to oppor- 
tunistically access the licensed channels owned by spectrum 
holders (primary users), and thus can significantly improve the 
spectrum efficiency |2)- 

A key challenge of the cognitive radio technology is how 
to share the spectrum resources efficiently in a distributed 
fashion. A common modeling approach is to consider self- 
ish secondary users, and model their interactions as non- 
cooperative games {e.g., O-IH). Liu and Wu in |5) modeled 
the interactions among spatially separated secondary users as 
congestion games with resource reuse. Elias et al. in (6) stud- 
ied the competitive spectrum access by multiple interference- 
sensitive secondary users. Nie and Comniciu in [7 | designed 
a self-enforcing distributed spectrum access mechanism based 
on potential games. Law et al. in (8l studied the price of 
anarchy of spectrum access game, and showed that users' 
selfish choices may significantly degrade system performance. 
A common assumption of the above results is that each user 
knows the complete network information. This is, however, 
often expensive or infeasible to achieve due to significant 
signaling overhead and the competitors' unwillingness to 
share information. 

Another common assumption of all the above work is 
that secondary users are fully rational and thus often adopt 
their channel selections based on best responses, i.e., the best 
choices they can compute by having the complete network 
information. To have full rationality, a user needs to have a 
high computational power to collect and analyze the network 
information in order to predict other users' behaviors. This is 
often not feasible due to the limitations of today's wireless 
devices. 
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Another body of related work focused on the design of 
spectrum access mechanisms assuming bounded rationality of 
secondary users, i.e., each user tries to improve its strategy 
adaptively over time. Chen and Huang in J9) designed an 
imitation-based spectrum access mechanism by letting sec- 
ondary users imitate other users' successful channel selections. 
When not knowing the channel selections of other users, 
secondary users need to learn the environment and adapt the 
channel selection decisions accordingly. Authors in ifTOl . ifTTIl 
used no-regret learning to solve this problem, assuming that 
the users' channel selections are common information. The 
learning converges to a correlated equilibrium iflZl . wherein 
the common observed history serves as a signal to coordinate 
all users' channel selections. When users' channel selections 
are not observable, authors in irT3l - lfT5l designed multi-agent 
multi-armed bandit learning algorithm to minimize the ex- 
pected performance loss of distributed spectrum access. Li 
in lfl6l applied reinforcement learning to analyze Aloha-type 
spectrum access. 

In this paper, we propose a new framework of distributed 
spectrum access with and without complete network informa- 
tion (i.e., channel statistics and user selections). The common 
characteristics of algorithms under this framework is also 
bounded rationality, which requires much less computation 
power than the full rationality case, and thus may better match 
the reality of wireless communications. We first propose an 
evolutionary game approach for distributed spectrum access 
with the complete network information, where each secondary 
user takes a comparison strategy (i.e., comparing its payoff 
with the system average payoff) to evolve its spectrum access 
decision over time. We then propose a learning mechanism 
for distributed spectrum access with incomplete information, 
which does not require any prior knowledge of channel statis- 
tics or information exchange among users. In this case, each 
secondary user estimates its expected throughput locally, and 
learns to adjust its channel selection strategy adaptively. 

The main results and contributions of this paper are as 
follows: 

• Evolutionary spectrum access mechanism: we formulate 
the distributed spectrum access over multiple heteroge- 
neous time-varying licensed channels as an evolutionary 
spectrum access game, and study the evolutionary dy- 
namics of spectrum access. 

• Evolutionary dynamics and stability: we show that the 
evolutionary spectrum access mechanism converges to 
the evolutionary equilibrium, and prove that it is globally 
evolutionarily stable. 

• Learning mechanism with incomplete information: we 
further propose a learning mechanism without the knowl- 
edge of channel statistics and user information exchange. 
We show that the learning mechanism converges to the 
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same evolutionary equilibrium on the time average. 
> Superior performance: we show that the proposed mech- 
anisms can achieve up to 35% performance improvement 
over the distributed reinforcement learning mechanism in 
literature, and are robust to the perturbations of users' 
channel selections. 

The rest of the paper is organized as follows. We introduce 
the system model in Section [TTJ After briefly reviewing the 
evolutionary game theory in Section [III] we present the evolu- 
tionary spectrum access mechanism with complete information 
in Section [IV] Then we introduce the learning mechanism 
in Section [V] We illustrate the performance of the proposed 
mechanisms through numerical results in Section [VT] and 
finally conclude in Section [VTT1 



Channel 


Channel 


Data 


Channel 


Sensing 


Contention 


Transmission 


Selection 



II. System Model 

We consider a cognitive radio network with a set A4 = 
{1, 2, M} of independent and stochastically heterogeneous 
licensed channels. A set N = {1,2, ...,N} of secondary 
users try to opportunistically access these channels, when the 
channels are not occupied by primary (licensed) transmissions. 
The system model has a slotted transmission structure as in 
Figure Q] and is described as follows. 

• Channel State: the channel state for a channel m at time 
slot t is 

{0, if channel m is occupied by 
primary transmissions, 
1, if channel m is idle. 

• Channel State Changing: for a channel m, we assume that 
the channel state is an i.i.d. Bernoulli random variable, 
with an idle probability 9 m G (0, 1) and a busy probabil- 
ity l—6 m . This model can be a good approximation of the 
reality if the time slots for secondary transmissions are 
sufficiently long or the primary transmissions are highly 
bursty |[T7l . Numerical results show that the proposed 
mechanisms also work well in the Markovian channel 
environment. 

• Heterogeneous Channel Throughput: if a channel m is 
idle, the achievable data rate b m (t) by a secondary user 
in each time slot t evolves according to an i.i.d. random 
process with a mean B m , due to the local environmental 
effects such fading. For example, in a frequency-selective 
Rayleigh fading channel environment we can compute the 
channel data rate according to the Shannon capacity with 
the channel gain at a time slot being a realization of a 
random variable that follows the exponential distribution 

ma. 

> Time Slot Structure: each secondary user n executes the 
following stages synchronously during each time slot: 

- Channel Sensing: sense one of the channels based on 
the channel selection decision generated at the end 
of previous time slot. Access the channel if it is idle. 

- Channel Contention: use a backoff mechanism to 
resolve collisions when multiple secondary users 
access the same idle channel. The contention stage 



Fig. 1. Multiple stages in a single time slot. 

of a time slot is divided into A max mini- slotsQ (see 
Figure [TJ, and user n executes the following two 
steps. First, count down according to a randomly and 
uniformly chosen integral backoff time (number of 
mini-slots) A„ between 1 and A max . Second, once 
the timer expires, transmit RTS/CTS messages if the 
channel is clear (i.e., no ongoing transmission). Note 
that if multiple users choose the same backoff value 
A„, a collision will occur with RTS/CTS transmis- 
sions and no users can successfully grab the channel. 

- Data Transmission: transmit data packets if the 
RTS/CTS message exchanges go through and the 
user successfully grabs the channel. 

- Channel Selection: in the complete information case, 
users broadcast the chosen channel IDs to other users 
through a common control channel and then make 
the channel selection decisions based on the evolu- 
tionary spectrum access mechanism (details in Sec- 
tion [iVj. With incomplete information, users update 
the channel estimations based on the current access 
results, and make the channel selection decisions 
according to the distributed learning mechanism (de- 
tails in Section Pvt. 

Suppose that k m users choose an idle channel m to access. 
Then the probability that a user n (out of the k m users) grabs 
the channel m is 

g(k m ) = Pr{X n < min{A,-}} 

Amax 

= V Pr{X n = \}Pr{\ < miniAiHAn = A} 



, _^ A max \ A ma > 



fcm-l 



which is a decreasing function of the number of total contend- 
ing users k m . Then the expected throughput of a secondary 
user n choosing a channel m is given as 



U n — @mB m g(k ri 



(1) 



For the ease of exposition, we will focus on the analysis 
of the proposed spectrum access mechanisms in the many- 
users regime. Numerical results show that our algorithms also 

For the ease of exposition, we assume that the contention backoff size 
A ma x is fixed. This corresponds to an equilibrium model for the case that 
the backoff size Amax can be dynamically tuned according to the 802.11 
distributed coordination function 1191 . Also, we can enhance the performance 
of the backoff mechanism by d eterm ining optimal fixed contention backoff 
size according to the method in 1201 . 

2 Please refer to 1211 for the details on how to set up and maintain a reliable 
common control channel in cognitive radio networks. 
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apply when the number of users is small (see Section IVI-CI 
for the details). Since our analysis is from secondary users' 
perspective, we will use terms "secondary user" and "user" 
interchangeably. 

III. Overview of Evolutionary Game Theory 

For the sake of completeness, we will briefly describe the 
background of evolutionary game theory. Detailed introduction 
can be found in ll22l . Evolutionary game theory was first used 
in biology to study the change of animal populations, and 
then later applied in economics to model human behaviors. 
It is most useful to understand how a large population of 
users converge to Nash equilibria in a dynamic system 11221 . 
A player in an evolutionary game has bounded rationality, i.e., 
limited computational capability and knowledge, and improves 
its decisions as it learns about the environment over time J22). 

The evolutionarily stable strategy (ESS) is a key concept 
to describe the evolutionary equilibrium. For simplicity, we 
will introduce the ESS definition (the strict Nash equilibrium 
in Definition respectively) in the context of a symmetric 
game where all users adopt the same strategy i at the ESS 
(strict Nash equilibrium, respectively). The definition can be 
(and will be) extended to the case of asymmetric game l22l . 
where we view the population's collective behavior as a mixed 
strategy i at the ESS (strict Nash equilibrium, respectively). 

An ESS ensures the stability such that the population is 
robust to perturbations by a small fraction of players. Suppose 
that a small share e of players in the population deviate to 
choose a mutant strategy j, while all other players stick to the 
incumbent strategy i. We denote the population state of the 
game as x (1 _ f y l+fJ = (x. t = l-e,xj = e,Xi = 0,VZ ^ 
where x a denotes the fraction of users choosing strategy 
a, and the corresponding payoff of choosing strategy a as 
R(a,X(i- e ) i+e j). 

Definition 1 (J22)). A strategy i is an evolutionarily stable 
strategy if for every strategy j ^ i, there exists an e £ (0, 1) 
such that R{i,x ej+{1 _ e)i ) > R(j,x ej+{1 ^ e y l ) for any j ^ i 
and e £ (0, e). 

Definition Q] means that the mutant strategy j cannot invade 
the population when the perturbation is small enough, if the 
incumbent strategy i is an ESS. It is shown in [22 1 that any 
strict Nash equilibrium in noncooperative games is also an 
ESS. 

Definition 2 ([22]). A strategy i is a strict Nash equilibrium 

if for every strategy j ^ i, it satisfies that R(i,i, > 
R(j, i, i), where R(a, i, i) denotes the payoff of choosing 
strategy a £ {i, j} given other players adhering to the strategy 
i. 

To understand that a strict Nash is an ESS, we can set e — > 
in Definition [T] which leads to R(i,x.i) > R(j,Xi),Vj ^ i, 
i.e., given that almost all other players play the incumbent 
strategy i, choosing any mutant strategy j ^ i will lead to a 
loss in payoff. 

Several recent results applied the evolutionary game theory 
to study various networking problems. Niyato and Hossain in 



Algorithm 1 Evolutionary Spectrum Access Mechanism 
l: initialization: 

2: set the global strategy adaptation factor a £ (0, 1]. 
3: select a random channel for each user. 
4: end initialization 

5: loop for each time slot t and each user n £ Af in parallel: 

6: sense and contend for the chosen channel and transmit 
data packets if successfully grabbing the channel. 

7: broadcast the chosen channel ID to other users 
through a common control channel. 

8: receive the information of other users' channel selec- 
tion and calculate the population state x(t). 

9: compute the expected payoff U n (a n ,x(t)) and the 
system average payoff U(x(i)) according to (0 and (0, 
respectively. 
10: if U n (a n ,x(t)) < U(x(t)) then 
1 1 : generate a random value S according to a uniform 

distribution on (0,1). 

13: select a better channel m with probability 

= max{e m B m g(Nx m (t)) - U(x{t)),0} 

Pm YZ'=i ms ^{ e -m'B m '9{Nx m ,{t)) - 17(35(4)), 0}' 

14: else select the original channel. 

15: end if 

16: end if 

17: end loop 



ifPTl investigated the evolutionary dynamics of heterogeneous 
network selections. Zhang et al. in [23] designed incentive 
schemes for resource-sharing in P2P networks based on the 
evolutionary game theory. Wang et al. in |24| proposed 
the evolutionary game approach for collaborative spectrum 
sensing mechanism design in cognitive radio networks. Ac- 
cording to Definition [T] the ESS obtained in HTJ, ED, ED 
is locally evolutionarily stable (i.e., the mutation e is small 
enough). Here we apply the evolutionary game theory to 
design spectrum access mechanism, which can achieve global 
evolutionary stability (i.e., the mutation e can be arbitrarily 
large). 

IV. Evolutionary Spectrum Access 

We now apply the evolutionary game theory to design an 
efficient and stable spectrum access mechanism with complete 
network information. We will show that the spectrum access 
equilibrium is an ESS, which guarantees that the spectrum 
access mechanism is robust to random perturbations of users' 
channel selections. 

A. Evolutionary Game Formulation 

The evolutionary spectrum access game is formulated as 
follows: 

• Players: the set of users N = {1, 2, N}. 

• Strategies: each user can access one of the set of channels 
M = {1,2,..., Al}. 
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Population state: the user distribution over M channels 
at time t, x(t) = (x m (t),\/m £ M), where x m (t) is the 
proportion of users selecting channel m at time t. We 
have Emew x m(t) = 1 for all t. 

Payoff: a user n's expected throughput U n (a n ,x(t)) 
when choosing channel a n <E M., given that the popu- 
lation state is x(t). Since each user has the information 
of channel statistics, from ([T), we have 



U n (a n ,x(t)) = e an B an g{Nx an {t))- 



(2) 



We also denote the system arithmetic average payoff 
under population state x(t) as 



M 

U(x(t)) = JJ Y, S m B m g{Nx m {t)). 



(3) 



m— 1 



B. Evolutionary Dynamics 

Based on the evolutionary game formulation above, we 
propose an evolutionary spectrum access mechanism in Al- 
gorithm Q] by reversing-engineering the replicator dynamics. 
The idea is to let those users who have payoffs lower than 
the system average payoff U(x(t) to select a better channel, 
with a probability proportional to the (normalized) channel's 
"net fitness" 8 m B m g(Nx m (t)) — U(x(t)). We show that the 
dynamics of channel selections in the mechanism can be 
described with the evolutionary dynamics in ®. The proof 
is given in the Appendix. 

Theorem 1. For the evolutionary spectrum access mechanism 
in Algorithm^ the evolutionary dynamics are given as 



•Em (^) 



U n (m,x{t)) 



1 ,Vm GM, 



U(x(t)) 

where the derivative is with respect to time t. 



(4) 



C. Evolutionary Equilibrium in Asymptotic Case X max = oo 

We next investigate the equilibrium of the evolutionary 
spectrum access mechanism. To obtain useful insights, we first 
focus on the asymptotic case where the number of backoff 
mini-slots X max g° es to oo, such that 



g(k) = lim 



= lim 



A=l 

E 

A=0 

dz = —. 



1 



X,, 



X 



x max 
fc-1 



fe-1 



(5) 



This is a good approximation when the number of mini-slots 
X m ax for backoff is much larger than the number of users 
N and collisions rarely occur. In this case, U n (a n ,x(t)) = 

^ B ^ and U(x(t)) = ^' i= ^/'~ B ' . According to Theorem Q] 
the evolutionary dynamics in © become 

/ Jll B m \ 

x m (t) = a 



*m(t) 



1_ 6jBi 
M Z-(i=l Xi (t) 



(6) 



From (|6]l, we have 

Theorem 2. The evolutionary spectrum access mechanism 
in asymptotic case X ma x = oo globally converges to the 



evolutionary equilibrium x* 



, Vm e M 



The proof is given in the Appendix. Theorem |2]implies that 

Corollary 1. The evolutionary spectrum access mechanism 
converges to the equilibrium x* such that users on different 
channels achieve the same expected throughput, i.e., 



U n (m,x*) = U n (m ,x*),\/m,m € M. 



(7) 



We next show that for the general case X max < oo, the 
evolutionary dynamics also globally converges to the ESS 
equilibrium as given in (0. 

D. Evolutionary Equilibrium in General Case X max < oo 

For the general case X max , since the channel grabbing 
probability g(k) does not have the close-form expression, 
it is hence difficult to obtain the equilibrium solution of 
differential equations in ©. However, it is easy to verify that 
the equilibrium x* in (O is also a stationary point such that 
the evolutionary dynamics (HJi in the general case X max < oo 
satisfy x m (t) = 0. Thus, at the equilibrium x*, users on 
different channels achieve the same expected throughput. 

We now study the evolutionary stability of the equilibrium. 
In general, the equilibrium of the replicator dynamics may not 
be an ESS 1221 . For our model, we can prove the following. 



Theorem 3. For the evolutionary spectrum access mechanism, 
the evolutionary equilibrium x* in (0 is an ESS. 



The proof is given in Section IVIII-DI Actually we can 
obtain a stronger result than Theorem [3] Typically, an ESS is 
only locally asymptotically stable (i.e., stable within a limited 
region around the ESS) El . For our case, we show that the 
evolutionary equilibrium x* is globally asymptotically stable 
(i.e., stable in the entire feasible region of a population state 
x, {x = (x m ,m e A<)|E m =i*m = 1 and x m > 0,V?n E 
M}). 

To proceed, we first define the following function 



L(x) 



M 

E 

m— 1 



u in 

B m g(Nz)dz. 



(8) 



Since g(-) is a decreasing function, it is easy to check that 
the Hessian matrix of L(x) is negative definite. It follows that 
L(x) is strictly concave and hence has a unique global max- 
imum L*. By the first order condition, we obtain the optimal 
solution x*, which is the same as the evolutionary equilibrium 
x* in Q. Then by showing that V(x(t)) = L* - L(x(t)) is 
a strict Lyapunov function, we have 

Theorem 4. For the evolutionary spectrum access mechanism, 
the evolutionary equilibrium x* in (0 is globally asymptoti- 
cally stable. 

The proof is given in the Appendix. Since the ESS is glob- 
ally asymptotically stable, the evolutionary spectrum access 
mechanism is robust to any degree of (not necessarily small) 
random perturbations of channel selections. 
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Algorithm 2 Learning Mechanism For Distributed Spectrum 


Decision 


Decision 




Decision 
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Period 1 


Period 2 




Period T 





l: initialization: 

2: set the global memory weight 7 G (0,1) and the set 



of accessed channels M n — < 


3 for each user n. 


Time 


Time 




Time 


3: end initialization 




Slotl 


Slot 2 




Slott max 



4: loop for each user n E Af in parallel: 

> Initial Channel Estimation Stage 
5: while M n ^ M do 

6: choose a channel m from the set M n randomly. 

7: sense and contend to access the channel m at each 

time slot of the decision period. 
8: estimate the expected throughput f/ m . n (0) by Q. 

9: set M n = M n U {m}. 

10: end while 

> Access Strategy Learning Stage 
11: for for each time period T do 

12: choose a channel m to access according to the 

mixed strategy f n (T) in dTOb . 
13: sense and contend to access the channel m at each 

time slot of the decision period. 
14: estimate the qualities of the chosen channel m 

and the unchosen channels m' ^ m by ( fT2b and ( fTTT ). 

respectively. 
15: end for 
16: end loop 



V. Learning Mechanism For Distributed spectrum 
access 

For the evolutionary spectrum access mechanism in Section 
IIV1 we assume that each user has the perfect knowledge 
of channel statistics and the population state by information 
exchange on a common control channel. Such mechanism 
leads to significant communication overhead and energy con- 
sumption, and may even be impossible in some systems. We 
thus propose a learning mechanism for distributed spectrum 
access with incomplete information. The challenge is how to 
achieve the evolutionarily stable state based on user's local 
observations only. 

A. Learning Mechanism For Distributed Spectrum Access 

The proposed learning process is shown in Algorithm |2] and 
has two sequential stages: initial channel estimation (line 5 to 
10) and access strategy learning (line 11 to 15). Each stage 
is defined over a sequence of decision periods T = 1,2,..., 
where each decision period consists of £ max time slots (see 
Figure 12 as an illustration). 

The key idea of distributed learning here is to adapt each 
user's spectrum access decision based on its accumulated 
experiences. In the first stage, each user initially estimates 
the expected throughput by accessing all the channels in a 
randomized round-robin manner. This ensures that all users 
do not choose the same channel at the same period. Let M n 



Fig. 2. Learning time structure 



(equals to initially) be the set of channels accessed by user 
n and M c n = M\M n . At beginning of each decision period, 
user n randomly chooses a channel m G M n (i.e., a channel 
that has not been accessed before) to access. At end of the 
period, user n can estimate the expected throughput by sample 
averaging as 



.(0) = (l-7) 



J2t=T b m {t)I{a n (t,T)=m} 



tu 



(9) 



where < 7 < 1 is called the memory weight and 
I{a n (t.T)=m} is an indicator function and equals 1 if the 
channel m is idle at time slot t and the user n chooses and 
successfully grabs the channel m. Motivation of multiplying 
(1 — 7) in (0 is to scale down the impact of the noisy 
instantaneous estimation on the learning. Note that there are 
t max time slots within each decision period, and thus the user 
will be able to have a fairly good estimation of the expected 
throughput if i max is reasonably large. Then user n updates 
the set of accessed channels as M n = M n U {m}. When all 
the channels are accessed, i.e., M n = M, the stage of initial 
channel estimation ends. Thus, the total time slots for the first 
stage is Mt max . 

In the second stage, at each period T > 1, each user 
n G TV selects a channel m to access according to a mixed 
strategy f n (T) = (fi,„(T), f M , n {T)), where f m , n (T) is 
the probability of user n choosing channel m and is computed 
as 



/m,n(^) 



-T-l 



(r) 



,VmGAL (10) 



Here Z m .n(j) is user n's estimation of the quality of channel 
m at period r (see (flTT i and (fT2l i later). The update in (ITOb 
means that each user adjusts its mixed strategy according to 
its weighted average estimations of all channels' qualities. 

Suppose that user n chooses channel m to access at period 
r. For the unchosen channels m! ^ m at this period, user n 
can empirically estimate the quality of this channel according 
to its past memories as 

T-l 



Z n 



l w = (i-7)E^ T_lz " 



(11) 



For the chosen channel m, user n will update the estimation 
of this channel m by combining the empirical estimation with 
the real-time throughput measurement in this period, i.e., 



Z n 



• W =(1-7) E -f^'-^Ar) 
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(t,r)=m} 



t„ 



(12) 



B. Convergence of Learning Mechanism 

We now study the convergence of the learning mechanism. 
Since each user only utilizes its local estimation to adjust its 
mixed channel access strategy, the exact ESS is difficult to 
achieve due to the random estimation noise. We will show 
that the learning mechanism can converge to the ESS on time 
average. 

According to the theory of stochastic approximation 11251 . 
the limiting behaviors of the learning mechanism with the 
random estimation noise can be well approximated by the 
corresponding mean dynamics. We thus study the mean dy- 
namics of the learning mechanism. To proceed, we define the 
mapping from the mixed channel access strategies f(T) = 
(/ X (T), f N (T)) to the mean throughput of user n choosing 
channel m as Q m ,„(/(T)) = E[U n (m,x(T))\f(T)}. Here 
the expectation E[-] is taken with respective to the mixed 
strategies f(T) of all users. We show that 

Theorem 5. As the memory weight 7 —> 1, the mean dynamics 
of the learning mechanism for distributed spectrum access are 
given as (Vm G M, n e J\f) 

fm,n( T ) = fm.n(T) \ Qm,n(f( T )) ~ J2 kn{ T )QiMi T )) 



i=l 



(13) 



where the derivative is with respect to period T. 



The proof is given in Section fVIII-EI Interestingly, similarly 
with the evolutionary dynamics in |@), the learning dynamics 
in ([T3l imply that if a channel offers a higher throughput for 
a user than the user's average throughput over all channels, 
then the user will exploit that channel more often in the future 
learning. However, the evolutionary dynamics in (@) are based 
on the population level with complete network information, 
while the learning dynamics in ( fT3l l are derived from the 
individual local estimations. We show in Theorem [6] that the 
mean dynamics of learning mechanism converge to the ESS 
in ©, i.e., Q m ,n(f*) = Qm',n(f*)- 

Theorem 6. As the memory weight 7 — > 1, the mean dynamics 
of the learning mechanism for distributed spectrum access 
asymptotically converge to a limiting point f* such that 



QmAf*) = Qm',«(/*),Vm,m / e M,Vn e AT. 



(14) 



The proof is given in Section [VlII-FI Since Q m .n(f*) = 
E[U n (m,x(T))\f*] and the mean dynamics converge to the 
equilibrium /* satisfying ([Pil l (i.e., E[U n (m, x(T))\f*] = 
E[U n (m', x(T))\f*}), the learning mechanism thus converges 
to the ESS © (achieved by the evolutionary spectrum access 
mechanism) on the time average. Note that both the evolution- 
ary spectrum access mechanism in Algorithm Q] and learning 
mechanism in Algorithm [2] involve basic arithmetic operations 
and random number generation over M channels, and hence 
have a linear computational complexity of 0(M) for each 
iteration. However, due to the incomplete information, the 



-a- Number of users N=100 
— © — Number of users N=200 
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Strategy Adaptation Factor a 



Fig. 3. The iterations need for the convergence of the evolutionary spectrum 
accessing mechanism with different choices of strategy adaptation factor a. 
The confidence interval is 95%. 



learning mechanism typically takes a longer convergence time 
in order to get a good estimation of the environment. 

VI. Simulation Results 

In this section, we evaluate the proposed algorithms by 
simulations. We consider a cognitive radio network consisting 
M = 5 Rayleigh fading channels. The channel idle proba- 
bilities are {9 m }m = i = {f'f'f'l't}- The data rate on a 
channel m is computed according to the Shannon capacity, 
i.e., b m = £ TO log 2 (l + Fn M m ), where C, m is the bandwidth 



N 

of channel m, P n is the power adopted by users, Nq is the 
noise power, and h m is the channel gain (a realization of a 
random variable that follows the exponential distribution with 
the mean h m ). In the following simulations, we set ( m = 10 
MHz, iV = -100 dBm, and P n = 100 mW. By choosing 
different mean channel gain h m , we have different mean data 
rates B m = E[b m ], which equal 15, 70, 90, 20 and 100 Mbps, 
respectively. 

A. Evolutionary Spectrum Access in Large User Population 
Case 

We first study the evolutionary spectrum mechanism with 
complete network information in Section [IV] with a large 
user population. We found that the convergence speed of 
the evolutionary spectrum access mechanism increases as the 
strategy adaptation factor a increases (see Figure|3]l. We set the 
strategy adaptation factor a = 0.5 in the following simulations 
in order to better demonstrate the evolutionary dynamics. We 
implement the evolutionary spectrum access mechanism with 
the number of users N — 100 and 200, respectively, in both 
large and small \ max cases. 

1) Large A max Case: We first consider the case that the 
number of backoff mini-slots X max = 100000, which is much 
larger that the number of users N and thus collisions in chan- 
nel contention rarely occur. This case can be approximated 
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Fig. 4. The fraction of users on each channel and the expected user payoff 
of accessing different channels with the number of users N = 100 and 200, 
respectively, and the number of backoff mini-slots Xmax = 100000. 



Fig. 6. The fraction of users on each channel and the expected user payoff 
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mini-slots X m ax = 20. 



by the asymptotic case X max = oo in Section IIV-CI The 
simulation results are shown in Figures |4] and [5] From these 
figures, we see that 

• Fast convergence: the algorithm takes less than 20 itera- 
tions to converge in all cases (see Figure 0J. 

. Convergence to ESS: in both N = 100 and 200 
cases, the algorithm converges to the ESS x* = 

f pft.^n, . -> s^Tr:) ( see Fi 8 ure the left column of 



[4). At the ESS or, each user achieves the same expected 



payoff U n (a* nl x*) = ' ' (see the right column of 

Figure |4j. 

Asymptotic stability: to investigate the stability of the evo- 
lutionary spectrum access mechanism, we let a fraction 
of users play the mutant strategies when the system is at 
the ESS x*. At time slot t = 30, e = 0.5 and 0.9 fraction 



of users will randomly choose a new channel. The result 
is shown in Figure [5] We see that the algorithm is capable 
to recover the ESS x* quickly after the mutation occurs. 
This demonstrates that the evolutionary spectrum access 
mechanism is robust to the perturbations in the network. 

2) Small X max Case: We now consider the case that the 
number of backoff mini-slots X max = 20, which is smaller 
than the number of users N. In this case, severe collisions in 
channel contention may occur and hence lead to a reduction in 
data rates for all users. The results are shown in Figures |6]and 
|7] We see that a small X max leads to a system performance loss 
( Le " E„=i U n (a n (T), x(T)) < Em=i 9 mB m ), due to severe 
collisions in channel contention. However, the evolutionary 
spectrum access mechanism still quickly converges to the ESS 
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Fig. 8. Learning mechanism for distributed spectrum access with the number 
of users N = 100 and 200, respectively, and the number of backoff mini-slots 
A maI = 100000. 



as given in (O such that all users achieve the same expected 
throughput, and the asymptotic stable property also holds. This 
verifies the efficiency of the mechanism in the small \ max 
case. 

B. Distributed Learning Mechanism in Large User Population 
Case 

We next evaluate the learning mechanism for distributed 
spectrum access with a large user population. We implement 
the learning mechanism with the number of users N = 100 
and N = 200, respectively, in both large and small X max 
cases. We set the memory factor 7 = 0.99 and the length of 
a decision period i max = 100 time slots, which provides a 
good estimation of the mean data rate. Figures [8] and [9] show 
the time average user distribution on the channels converges 
to the ESS, and the time average user's payoff converges the 
expected payoff at the ESS. Note that users achieve this result 
without prior knowledge of the statistics of the channels, and 
the number of users utilizing each channel keeps changing in 
the learning scheme. 
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of backoff mini-slots Xmax = 20. 



C. Evolutionary Spectrum Access and Distributed Learning in 
Small User Population Case 

We then consider the case that the user population N is 
small. We implement the proposed evolutionary spectrum ac- 
cess mechanism and distributed learning mechanism with the 
number of users N = 4 and the number of backoff mini-slots 
Amax = 20. The results are shown in Figure [10] We see that 
the evolutionary spectrum access mechanism converges to the 
equilibrium such that channel 5 has 2 users and both channel 
1 and 2 have 1 user. These 4 users achieve the expected 
throughput equal to 50, 40, 38 and 38 Mbps, respectively, at 
the equilibrium. It is easy to check that any user unilaterally 
changes its channel selection at the equilibrium will lead to 
a loss in throughput, hence the equilibrium is a strict Nash 
equilibrium. According to 11221 . any strict Nash equilibrium 



is also an ESS and hence the convergent equilibrium is an 
ESS. For the distributed learning mechanism, we see that the 
mechanism also converges to the same equilibrium on the 
time average. This verifies that effectiveness of the proposed 
mechanisms in the small user population case. 

D. Performance Comparison 

To benchmark the performance of the proposed mecha- 
nisms, we compare them with the following two algorithms: 

• Centralized optimization: we solve the centralized opti- 
mization problem max x ^2^ =1 U n (a n ,x), i.e., find the 
optimal population state x opt that maximizes the system 
throughput. 
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• Distributed reinforcement learning: we also implement 
the distributed algorithm in |[T6l by generalizing the 
single-agent reinforcement learning to the multi-agent 
setting. More specifically, each user n maintains a per- 
ception value P^(T) to describe the performance of 
channel m, and select the channel m with the probability 

vP 71 (T) 

fm,n(T) = — jf- — m „ P H (r) where v is called the temper- 

ature. Once a payoff U n (T) is received, user n updates 
the perception value as P™ (T + 1) = (1 - p T )P^i T ) + 
fj,TU n (T)Ij an (T)=rn} where \xt is the smooth factor 
satisfying J2t=i (J-t = oo and Ysr=i Mt < 00 ■ As 
shown in 1161 . when v is sufficiently large, the algorithm 
converges to a stationary point. We hence set fix = =fr 
and ^ = 10 in the simulation, which guarantees the 
convergence and achieves a good system performance. 
Since the proposed learning mechanism in this paper can 
converge to the same equilibrium as the evolutionary spec- 
trum access mechanism, we only implement the evolutionary 
spectrum access mechanism in this experiment. The results 
are shown in Figure Q~T] Since the global optimum by cen- 
tralized optimization and the ESS by evolutionary spectrum 
access are deterministic, only the confidence interval of the 
distributed reinforcement learning is shown here. We see that 
the evolutionary spectrum access mechanism achieves up to 
35% performance improvement over the distributed reinforce- 
ment learning algorithm. Compared with the centralized opti- 
mization approach, the performance loss of the evolutionary 
spectrum access mechanism is at most 38%. When the number 
of users N is small (e.g., N < 50), the performance loss can 
be further reduced to less than 25%. Note that the solution 
by the centralized optimization is not incentive compatible, 
since it is not a Nash equilibrium and user can improve its 
payoff by changing its channel selection unilaterally. While 
the evolutionary spectrum access mechanism achieves an ESS, 
which is also a (strict) Nash equilibrium and evolutionarily 
stable. Interestingly, the curve of the evolutionary spectrum 
access mechanism in Figure QT| achieves a local minimum 
when the number of users N = 5. This can be interpreted 
by the property of the Nash equilibrium. When the number of 
users N = 4, these four users will utilize the three channels 
with high data rate (i.e., Channel 2, 3, and 5 in the simulation). 
When the number of users A*" = 5, the same three channels 
are utilized at the Nash equilibrium. In this case, there will be 
a system performance loss due to severer channel contention. 
However, no user at the equilibrium is willing to switch to 
another vacant channel, since the remaining vacant channels 
have low data rates and such a switch will incurs a loss to the 
user. When the number of users JV = 8, all given channels are 
utilized at the Nash equilibrium, and this improves the system 
performance. 

E. Distributed Learning Mechanism In Markovian Channel 
Environment 

For the ease of exposition, we have considered the i.i.d. 
channel model as far. We now consider the proposed mech- 
anisms in the Markovian channel environment. Since in the 
evolutionary spectrum access mechanism each user has the 
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Fig. 11. Comparison of the evolutionary spectrum access mechanism 
with the distributed reinforcement learning and centralized optimization. The 
confidence interval is 95%. 
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Fig. 12. Two states Markovian channel model 

complete information aprior (including the stationary distri- 
bution that a channel is idle), the Markovian setting will not 
affect the evolutionary spectrum access mechanism. We hence 
focus on evaluating the learning mechanism. 

We consider a network of N = 100 users and M = 10 
channels. The states of channels change according to inde- 
pendent Markovian processes (see Figure Q~2). We denote the 
channel state probability vector of channel m at time slot t as 
p m (t) = (Pr{S m (t) = 0,Pr{S m (t) = 1}}), which follows 
a two state Markov chain as p m (t) = p m (t — l)r,Vi > 1, 
with the transition matrix 

r = I" 1-P P 

q 1- q ' 

For the simulation, we set p = q = e, where e is called the 
dynamic factor. A larger e means that the channel state changes 
faster over time. The mean data rates B m of 10 channels are 
10, 40, 50, 20, 80, 60, 15, 25, 30, and 70 Mbps, respectively. 

We first set the dynamic factor e = 0.3, and study the 
learning mechanism with the different memory weights 7 = 
0.99,0.8,0.5, and 0.1, respectively. The results are shown in 
Figure [13] We see that a large enough memory weight (e.g., 
7 > 0.8) is needed to guarantee that the mechanism converges 
to the ESS equilibrium. When the memory weight is large, the 
noise of the local estimation by each user can be averaged out 
in the long run, and hence each user can achieve an accurate 
estimation of the environment. When the memory weight is 
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Fig. 14. Distributed learning mechanism with the memory weight 7 = 0.99 
in the Markovian channel environment with different dynamic factors e 



small, the most recent estimations will have a great impact 
on the learning. This means that the learning mechanism will 
over-exploit the current best channels, and get stuck in a local 
optimum. 

We next set the memory weight 7 = 0.99, and investigate 
the learning mechanism in the Markovian channel environ- 
ments with different dynamic factors e = 0.1,0.3,0.5, and 
0.7, respectively. The results are shown in Figure Q3] We see 
that the learning mechanism can converge to the ESS in all 
cases. This demonstrates that the learning mechanism is robust 
to the dynamic channel state changing. 

VII. Conclusion 

In this paper, we study the problem of distributed spec- 
trum access of multiple time-varying heterogeneous licensed 
channels, and propose an evolutionary spectrum access mech- 
anism based on evolutionary game theory. We show that 



the equilibrium of the mechanism is an evolutionarily stable 
strategy and is globally stable. We further propose a learning 
mechanism, which requires no information exchange among 
the users. We show that the learning mechanism converges 
to the evolutionarily stable strategy on the time average. 
Numerical results show that the proposed mechanisms can 
achieve efficient and stable spectrum sharing among the users. 

One possible direction of extending this result is to consider 
heterogeneous users, i.e. each user may achieve different mean 
data rates on the same channel. Another interesting direction 
is to take the spatial reuse effect into account. How to design 
an efficient evolutionarily stable spectrum access mechanism 
with spatial reuse will be challenging. 

VIII. Appendix 
A. Proof of Theorem Q] 

Given a population state x(t) = (xi(t), xu (t)), 
we divide the set of channels M into the following 
three complete and mutually exclusive subsets: Mi = 
{m G M\6 m B m g{Nx m {i)) < U(x(t))}, M 2 = {m G 
M\6 m B m g{Nx m {t)) = U(x(t))}, and M 3 = {m G 
M\0 m B m g(Nx m (t)) > U(x(t))}. 

For a channel m G Mi, each user n on this channel 
achieves an expected payoff less than the system average 
payoff, i.e., U n (m,x(t)) = O m B m g(Nx m (t)) < U(x(t)). 
According to the mechanism, each user has a probability of 
— 2L_ (1 — ^"f"; 3 ^*)) ) to move out of the channel m. Since 
6 m B m g(Nx m (t)) < U(x(t)), it follows that p m = and 
hence no other users will move into this channel. Thus, the 
dynamics are given as 

U n (m,x(t)y 



M = 



x m {t) V 1 U{x{t)) 
U n (m,x(t)) 



.(*) 



U(x(t)) 



1 , Vra G Mi. 



For a channel m G Mi, we have U n (m,x(t)) = U(x(t)) 
and p m = 0. Thus, x m (t) = 0, which satisfies the conclusion. 

For a channel m G M3, each user n on this channel 
achieves an expected payoff higher than the system average 
payoff, i.e., U n (m,x(t)) = 9 m B m g(Nx m (t)) > U(x(t)). 
According to the mechanism, no users will move out of the 
channel m. Since p m > 0, there will be some other users from 
the channel m! G M\ moving into this channel. Let uj be the 
fraction of population that carries out the movement. We have 



(x m '(t) - x m '(t+ 1)) 



m'eMi 



U n (m',x(t)) 



U(x(t)) 



m'eMi V 

= E tttSt^ 

a 



(U(x(t)) - 6 m ,B m ,g(Nx m ,{t))) 



U(x(t)) 1 

Since 9 m rB m ig(Nx m /(t)) = U(x(t)) for each channel m' G 
Mi, and Em'eM iU{x(t)) - 6 m ,B m ,g{Nx m ,{t))) = 0, we 
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then obtaion 

Em'eMi (U(x(t)) - 9 m ,B m ,g(Nx m ,(t))) 
= E m 'eM 3 (e m ,B m ,g(Nx m ,(t)) - U(x(t))) . 

Then, the fraction of the population moving into a channel 
m G M.z thus is 

X m (t) = PmU 

(8 m B m g(Nx m (t)) - U(x(t))) 

lZ m -eM, (Om>B m >g(Nx m >(t)) - U(x(t))) 
a 



{e m .B m ,g{Nx m .{t)) - U(x(t))) 



□ 



U(x(t)) 1 
( U n (m,x(t)) 

This completes the proof. 

B. Proof of Theorem 2 

First, it is easy to check that when x m (t) = x* m = 
we have x m (t) = in the evolutionary dynamics 



. Thus, x(t) is the equilibrium of the evolutionary dynam 



ICS. 



We then apply Lyapunov's second method 1261 to 
prove that the equilibrium x* is globally asymptotic sta- 
ble. We use the following Lyapunov function V(x(t)) = 



Em=i x m m l> ■ Jensen's inequality, we first have 



for any x(t) ^ x* 



M 



V(x(t)) > -In £ 



= -ln^a; ro (*)j =0. 



Thus, we obtain that V(x*) = and V(x(t)) > for any 
x(t) ^ x*. 

We then consider the time derivative of V(x(t)) as 
dV(x(t)) 



dt 

M 

E 



dV(x(t)) , 
^ dx m (t) 



A I 



= - E 



^ *7<m if) 



•Em (^) 



M 

1 y^A/ fliBj 2-^i 



x m I ®mB m 1 \ - QiB 



E 



Thus, we must have that 



dV(x") 
dt 



and 



< for 

at 



any x(t) ^ x* , which completes the proof. □ 
According to [22), any strict Nash equilibrium is also an 
ESS and hence the equilibrium x* is an ESS. □ 

C. Proof of Theorem 4 

According to Lyapunov's second method |26l , we prove the 
global asymptotic stability by using the following Lyapunov 
function V(x(t)) = L* - L(x(t)). Since L* is the unique 
global maximum of L(x{t)) achieved at x*, we thus have 

V(x(t)) > 0,\/x(t)^x*, 

V(x*) = 0. 

Then differentiating V(x(t)) with respective to time t, we 
have 

V(x(t)) 

M 

= - B m g(Nx m )x m (t) 

m—l 
M 

= ~ ^ U n (m,x(t))x m (t) 

m—l 
A/ 

- £ tf n ( m ,a.(t))_|L^ (U n (m,x(t)) - U(x(t))) 

m—l V V // 

/ A/ 



MU{x{t)) 



m—l 
Af 



J2 U n (m,x(t)) J2 U n (m',x{t)) 



m=l m' = l 

M M 



MU(x(t)) E E (^(m,x(t))-f7 n (m',a;(£))) : 

m—l m' — 1 



Thus, we obtain that 

T>(cc(i)) < 0,\fx(t)^x\ 
V(x*) = 0, 

which completes the proof. 



□ 



D. Proof of Theorem \3\ 

We first show that the solution in (Q is an equilibrium 
for the evolution dynamics in ©. Since U n (m,x*) = 
U n (m',x*) for any m,m' G AL it follows that U{x*) = 



A/ 



TZi |§ r^i X ™W V M £i / Ti E^i fn(«>*) - UnK**) for any 777 G AL Hence 



a/ , „ p. \ 2 M M n n a n 



U n (r, 



1) = 0, which is an equilibrium for the 



j \ a; m(i) / m =l j=l ■ E " l (^) 



M A/ / x 2 

7 m JD m Pi-D? 



m^iS V ^mW 



(7(x*) 

evolution dynamics in d?} 

We next show that the equilibrium x* is a strict Nash 
equilibrium. The expected payoff of a user n G A/ at the 
equilibrium population state a;* is given by [/„(a*,a;*) = 
Qa* n B a * n g{Nx* a *), where a* is the channel chosen by user n 
in the population state x* . Now suppose that user 77, makes 
an unilateral deviation to another channel a n ^ a*, and 
the population state becomes x' = (x*, x** _ 1; x*. — 

~h T x a* n +iT-^ x *a n -iT x *a n + i x a n +iT ■ ■ i x *n) ■ Then its ex- 
pected payoff becomes U n (a n ,x') = O an B an g(Nx* a + 
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1) < 6 an B an g(Nx* a J. For the equilibrium x*, we (1 ~t)C n {T) ^ ^ 



have Oa.Ba.gtNxZ.) = 6 an B an g{Nx* a J. It follows that EZi A Un {T) + (1 - 7)C«(T) 

t/„(a*,a;*) > U n (a n ,x'),Va n ^ a* n ,n £ N, which is a strict 
Nash equilibrium. 



Let /3(T) = E ^ ij4a „ (T)+ 7 (1 „ 7)c .„ (T) ^ and GB can be ex- 
presses as 

E. P rao/o/ r/™ ^ T+1 ) = /i ) n(T)(l-^(T) C 7 n (T)) + ^(T)C„(T). (17) 

The key idea of the proof is to first obtain the discrete Similarly, for the unchosen channel /(i.e., I {an(T )=f} = 0), 
time dynamics of the learning mechanism, and then derive 

the corresponding mean continuous time dynamics. y\, n ^p 71) = f-, n (T)(l — f3(T)C n (T)) (18) 



For simplicity, we first define that 

T-l 



T=0 



j\n\-L i >-} — jy 

According to (JTTJ and ( fT8l , we thus obtain the discrete time 
learning dynamics as 

/m,n(T + 1) - / m ,„(T) = p(T)C n (T)(I {an(T)=m} - f m>n (T)). 



T = 

T-l 



and t (19) 

C„(T) = ^ft) 7 {a„(t,T)^a„(T)} ^ since as 7 1, by the theory of stochastic 

* max approximation (Theorem 3.2) in ||231 , the limiting behavior of 

where /{ „( t ,<r)=a n (T)} indicates whether user n successfully the stochastic difference equations in (GUl is the same as its 

grabs the chosen channel a„(T), and hence C n (T) denotes me an continuous time dynamics by taking the expectation on 

the average throughput it received at period T. According to RHS of (O with respective to f(T), i.e., 
(fTTT i and (I121 l. we have 

T-1 = £[<? n (T)(J {an(T)=m} - /™,»(T))|/(T)] 

A m , n (T + 1) = Z m , n (T) + 7 J] ^-^Z m , n {r) =(1 - /™,„(T))£[C n (T)|a n (*) = m, f(T)]f m , n (T) 

- f m ,n( T ) E[C n (T)\a n (t) = i, f(T)]f hn (T). (20) 

Since C n (T) is the sample averaging estimation of the 

T-i expected throughput of the chosen channel, by the central 

+ lY^l T ~ T ~ 1 Z mn {T) limit theorem, we have E[C n (T)\a„(t) = m,f(T)] = 

r=o E[U„(a n (t) = m,x(T))\f(T)} = Q m ,„(/(T)). Then the 

t-i mean dynamics in d20l ) can be written as 

=(1 - 7)C„(T)I {a „ (T)=m} + g 7 T -^ 1 ^, 1 (r) ^ n(r) = Qmin(/(T))(1 _ f m n (T))f„ hn (T) 

=(1 - 7)C n (T)/ {o „ (x)=m} + A m> „(T), - / m ,„(T) QiM( T ))kn{ T ) 
where I{ an {T)=m} indicates whether user n chooses channel 

m in period T. Then the mixed strategy update in (ITOb _, / n (ffTW S^n (f(T\\f (T\ I 



=(1 - 7)a i (T)/ {a „ (T)=m} + (1 - 7) E T^^m^r) 



r=0 



becomes 



i=i 



^4 (T + 1) 

f (T + 1) = — - - which completes the proof. □ 

(T) + (l- 7 )C n (T)/ {an(r)=m} 



ZZiAn(T) + (l-l)C n (T) ' 

For the chosen channel j (i.e., I{a n (T)=j} = 1). we further 
have 



Proof of Theorem [6] 
We first denote the following function 

M r a:«(T) 



fi,n(T + l) = =B . d5) 

Ei=i + (1 - 7)^n(T) an d 



$(/(T)) = / ' 6 i B i g(Nz)dz\f(T)}, 

■ 1 J —QO 



(1 -7)g»(^) M ~Xi (T) 

ZZi A *AT) + (1 - l)C n {T) *m,n(/(T)) = / z)dz\a n {T) = m, f(T)\ 

■ 1 J -co 



A 3 AT) E-iiAUT) 

EZl Ai,n(T) Etl A iA T ) + (1 - l)Cn{T) 

(l- 7 )C n (T) 



Obviously, we have 

M 



E^i^,n(T) + (l-7)C„(T) 

=/ „(T) (l (1 - 7)Cn(T) \ We f urt her denote x_ n (T) = (x-"(T),m G X) as the pop- 

\ Ef=i A,n(r) + (1 - 7)C n (T) / ulation state of all other users without user n. By considering 
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the user distributions on the chosen channel m by user n and 
the other channels, we then have 

*m,n(/(T)) 

M pXi(T) 

=£[]T / 9 l B t g{Nz)dz\a n {T) = m, f(T)} 
i=i 

= E E lYl / e i B l g{Nz)dz 

O m B m g(Nz)dz\a n {T) = m,X- n {T)\ 

1 

x Pr{x- n (T)\f(T)} 
= E E / O l B l g(Nz)dz 

6 m B m g(Nz)dz\ Pr{x- n (T)\f(T)}. 

(21) 



Similarly, we can obtain 

*tn',n(/(T)) = £ ( 5^ / 



•pen 



iB ig (Nz)dz 



cc_„(T) 



a:-?(T) + ^ 



6 m ,B m ,g{Nz)dz Pr{x_ n (T)\f(T)}. 

) 

(22) 

It follows that 

*m,n(/(r)) - *«',T.(/(r)) 



: E E 

x-„(T) 



-™(T) 



l B,g(Nz)dz 



l B m g{Nz)dz Pr{x_ n (T)\f(T)} 



EE/ 9 i B i g(Nz)dz 

sb_„(T) \i^m' J -° D 

O m ,B ml g{Nz)dz Pr{x^ n {T)\f{T)} 



X I / 9 m B m g{Nz)dz 



X- n (T) 



*m"(T) 



9 m B m g(Nz) Pr{x- n (T)\f(T)} 



X / 6 m ,B m ,g{Nz)dz 

X- n {T) V"'- 00 

9 m ,B m ,g(Nz) Pr{x_ n (T)\f(T)}. (23) 



Since AT is large, we obtain that for i e {m, m'} 

9 l B t g(Nz)dz- / 9 l B l g(Nz) 



9 l B l g(Nz)dz = / 9 i B l g(z)dz 

~"(T) JNx7 n (T) 

^eiB ig (Nxr n (T) + l). (24) 
According to d23l and d24l i, we have 

$ m ,„(/(T)) - $ m ,,„(/(T)) 
= X (^^(^""(T) + 1) 

*-n(T) 

- 6 m ,B m ,g{Nx^{T) + 1)) Pr{*_„(T)|/(T)} 
[C/„ (c - m, x (i ) ) | / (T)] - £ [U n K = m' , x (t ) ) | / (T)] 

=Q m ,n{f( T )) ~ Qm>,n(f(T)). (25) 

We then consider the variation of $(/(T)) along the 
trajectories of learning dynamics in (TTjt , i.e., differentiating 
$(/(T)) with respective to time T, 

d$(/(T)) ^ d*(/(T)) 4f TO ,„(T) 



A I 



-- E *m,»(/(T))/m,n(r) 

771 — 1 

X ^Qm,n(/(T))-5^/ il „(T)Q i , n (/(T))^ 
A/ A/ 

= 2 EE^.»( T )An(r) 

777 — 1 i=l 

X (Q m ,„(/(T)) - Qi,n(f{T))) ($ m .„(/(T)) - $.,„(/(T))) 
1 M A/ 

= 9 E E An(T)hn(T) (Qi, n {f(T)) Q 3 Af{T))f 



i=l j = l 



>0. 



(26) 



Hence $(/(T)) is non-decreasing along the trajectories of 
the ODE ( fT3l ). According to Theorem 2.7 in fl26) , the learning 
mechanism asymptotically converges to a limit point /* such 



dT 



= 0, 



(27) 



i.e., for any m,i G Ai,n <E Af 

f* m Jtn (QmAf ) - QiAf*)) = °- (28) 

According to the mixed strategy update in ([Tol l, we know that 
fm n(T) > for any m G M. Thus, from ( f28l . we must have 

Qm,n(/*) = &,»(/*)■ □ 
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